Which Australian city has the cleanest air?

Article
Author

Adhni Mulachela

knitr::include_graphics("pollution.jpg")

Introduction

Pollution is one of the most serious issues in the whole world due to their adverse effects on the human health and environment. Monitoring air quality is crucial, and it typically involves measuring concentrations of the following pollutants, which are common in the air and likely to show risk to human’s health:

  • PM2.5: Particulate air pollutant particles with a physical morphology that has an aerodynamic diameter of less than 2.5 μm. Particularly they are very fine particles of one micron and below, which loose itself in the lungs or sometimes in the blood stream to develop hazardous diseases such as respiratory diseases or cardiovascular diseases.

  • PM10: These are particles measuring 10 micrometers and below which are irritable and toxic to humans. Still, PM2.5 is a fraction of its size and such particles can be inhaled they lead to respiratory ailments.

  • SO2 (Sulfur Dioxide): Agglutination product burnt through burning of coal, oil and other carbon containing products accompanied by the formation of oxide. High concentration of SO2 affects people’s respiratory systems and is responsible for causing acid rain.

  • O3 (Ozone): Ozone at ground level is one of the most dangerous air quality pollutants formed by the chemical actions with NOx and volatile organic chemicals-VOC. Ozone also causes respiratory diseases and makes the asthmatic worse.

  • NO2 (Nitrogen Dioxide): A gaseous substance which is mainly emitted through car fumes and factories. NO2 is known to cause inflammation of human respiratory tracts, worsening of respiratory diseases and production of ozone.

  • CO (Carbon Monoxide): A colourless, nonflammable, nontoxic gas that is a product of the combustion of hydrocarbon or other carbon-containing materials in an incomplete combustion process. High levels of carbon monoxide cause problems with blood and this results to health complications.

For these pollutants, diurnal and spatial distribution of the pollutants is illustrated and concentration of the pollutants between the selected cities is also compared.

To answer the question, Which Australian city has the cleanest air?, we first needed to determine which cities to consider and how to define the boundaries for each city (i.e., which sensors to include). To keep the analysis clean and useful, we decided to focus on the top 5 Australian cities based on population.

To ensure consistent and fair comparisons across cities, we defined the boundaries as a 10 km radius from the central station of each city. All air quality sensors located within this radius were included in the analysis, as can be found in the table below.

City State Population Central_Station_Coordinates Air_Stations Air_Station_Count
Melbourne Victoria 4929201 -37.814, 144.963 Moonee Ponds (Mason), Footscray, Kingsville, Brooklyn, Spotswood, Altona North, Melbourne CBD, Alphington 8
Sydney New South Wales 4892217 -33.883, 151.206 Anzac Memorial, Luna Lewisham, Sydney, Australia 3
Brisbane Queensland 2545882 -27.467, 153.017 Brisbane CBD, South Brisbane, Woolloongabba, Cannon Hill, Rocklea 5
Perth Western Australia 2205223 -31.950, 115.861 Subiaco 1
Adelaide South Australia 1399088 -34.928, 138.601 CBD West 1

To maintain consistency in our analysis, we defined a 10 km radius from the central station in each city and included all sensors within this area. The detailed map showing the exact sensor locations along with which pollutants it capture for each city can be found below.

Melbourne

location City pm10 pm25 co no2 o3 so2
Alphington Melbourne Yes Yes Yes Yes Yes Yes
Altona North Melbourne No Yes No Yes No Yes
Brooklyn Melbourne Yes Yes No No No No
Footscray Melbourne Yes Yes No No No No
Kingsville Melbourne Yes Yes No No No No
Melbourne CBD Melbourne No Yes No No No No
Spotswood Melbourne No Yes No No No No

Perth

location City pm10 pm25 co no2 o3 so2
Subiaco Perth Yes Yes No No No No

Sydney

location City pm10 pm25 co no2 o3 so2
Anzac Memorial Sydney Yes Yes No No No No
Luna Lewisham Sydney Yes Yes No No No No
Sydney, Australia Sydney Yes Yes No No No No

Brisbane

location City pm10 pm25 co no2 o3 so2
Brisbane CBD Brisbane Yes Yes No No No No
Cannon Hill Brisbane Yes Yes No No Yes No
Rocklea Brisbane Yes Yes No No Yes No
South Brisbane Brisbane Yes Yes No No No No
Woolloongabba Brisbane Yes Yes No No No No

Adelaide

location City pm10 pm25 co no2 o3 so2
CBD West Adelaide Yes Yes No No No No

Data description

Data Collection:

  • The data was collected using the airpurifyr package, which retrieves air quality measurements from the OpenAQ API, which is an open-source platform that aggregates air quality data from government and research organizations worldwide.

  • Data is collected via sensors located in various cities and locations in Australia. The get_measurements_for_location() function is used to pull data based on city, location, and time range.

Variables:

  • location_id: Identifier for the location of the sensor.
  • location: Name of the sensor’s location.
  • parameter: Type of air quality measurement (eg, PM2.5, NO2).
  • value: The pollutant concentration.
  • date_utc: Timestamp when the measurement was recorded.
  • unit: Measurement unit (typically µg/m³).
  • lat: Latitude of the sensor.
  • long: Longitude of the sensor.
  • country: Country code (e.g., AU for Australia).

Initial data analysis & Exploratory data analysis

Checking the data type

Below is a glimpse at the dataset.

tibble [44,522 × 14] (S3: tbl_df/tbl/data.frame)
 $ location_id                : int [1:44522] 5521 5521 5521 5521 5521 5521 5521 5521 5521 5521 ...
 $ location                   : chr [1:44522] "Rocklea" "Rocklea" "Rocklea" "Rocklea" ...
 $ parameter                  : chr [1:44522] "pm25" "pm25" "pm25" "pm25" ...
 $ value                      : num [1:44522] 20.6 20.7 20.7 20.4 20.4 20 19.5 19.2 18.9 18.5 ...
 $ date_utc                   : POSIXct[1:44522], format: "2024-08-31 00:00:00" ...
 $ unit                       : chr [1:44522] "µg/m³" "µg/m³" "µg/m³" "µg/m³" ...
 $ lat                        : num [1:44522] -27.5 -27.5 -27.5 -27.5 -27.5 ...
 $ long                       : num [1:44522] 153 153 153 153 153 ...
 $ country                    : chr [1:44522] "AU" "AU" "AU" "AU" ...
 $ City                       : chr [1:44522] "Brisbane" "Brisbane" "Brisbane" "Brisbane" ...
 $ State                      : chr [1:44522] "Queensland" "Queensland" "Queensland" "Queensland" ...
 $ Population                 : num [1:44522] 2545882 2545882 2545882 2545882 2545882 ...
 $ Central_Station_Coordinates: chr [1:44522] "-27.467, 153.017" "-27.467, 153.017" "-27.467, 153.017" "-27.467, 153.017" ...
 $ Air_Station_Count          : num [1:44522] 5 5 5 5 5 5 5 5 5 5 ...

After examining the structure of the dataset and the data types of its variable, we can infer that the data types are appropriately assigned based on the nature of each variable. For example, timestamps are appropriately stored as POSIXct for time-based analysis, and numerical values for pollutants are stored as numeric.

Aggregate by hours:

Below is a glimpse at the dataset.

# A tibble: 6 × 14
  location_id location parameter value date_utc           
        <int> <chr>    <chr>     <dbl> <dttm>             
1        5521 Rocklea  pm25       20.6 2024-08-31 00:00:00
2        5521 Rocklea  pm25       20.7 2024-08-30 23:00:00
3        5521 Rocklea  pm25       20.7 2024-08-30 22:00:00
4        5521 Rocklea  pm25       20.4 2024-08-30 21:00:00
5        5521 Rocklea  pm25       20.4 2024-08-30 20:00:00
6        5521 Rocklea  pm25       20   2024-08-30 19:00:00
# ℹ 9 more variables: unit <chr>, lat <dbl>, long <dbl>,
#   country <chr>, City <chr>, State <chr>,
#   Population <dbl>, Central_Station_Coordinates <chr>,
#   Air_Station_Count <dbl>

We can see that the values of the parameters are recorded hourly, which means it is already aggregated by hour. However, some time intervals are not recorded, therefore we will add in rows with observations with the missing hours and the values being NA, for further analysis.

Next, we will check the total numbers of time stamps records for all the parameters across the locations. The recorded data is within 60 days, which means 1440 (hourly) records for each parameter of each location. The below code will show how many records of each parameters across all locations.

parameter location timestamp_count records_proportion
co Alphington 1341 93
no2 Alphington 1340 93
no2 Altona North 1351 94
o3 Alphington 1343 93
o3 Cannon Hill 719 50
o3 Rocklea 719 50
pm10 Subiaco 1392 97
pm10 Alphington 1337 93
pm10 Anzac Memorial 1398 97
pm10 Brisbane CBD 719 50
pm10 Brooklyn 1417 98
pm10 CBD West 1398 97
pm10 Cannon Hill 719 50
pm10 Footscray 312 22
pm10 Kingsville 1316 91
pm10 Luna Lewisham 1393 97
pm10 Rocklea 719 50
pm10 South Brisbane 719 50
pm10 Sydney, Australia 1398 97
pm10 Woolloongabba 719 50
pm25 Subiaco 1392 97
pm25 Alphington 1411 98
pm25 Altona North 1403 97
pm25 Anzac Memorial 1398 97
pm25 Brisbane CBD 719 50
pm25 Brooklyn 1357 94
pm25 CBD West 1398 97
pm25 Cannon Hill 719 50
pm25 Footscray 1208 84
pm25 Kingsville 1307 91
pm25 Luna Lewisham 1393 97
pm25 Melbourne CBD 1414 98
pm25 Rocklea 719 50
pm25 South Brisbane 719 50
pm25 Spotswood 1387 96
pm25 Sydney, Australia 1398 97
pm25 Woolloongabba 719 50
so2 Alphington 1341 93
so2 Altona North 1351 94

From the table above, we can infer that most of the parameters are not complete with the hourly time stamps, and parameters pm10 and pm25 are recorded the most within the data set, whereas the other pollutants show significant low records across locations, leading to lower completeness and bias when comparing pollutants level across different areas.

Checking the missingness and outliers

We will use the vis_miss() function from the visdat package to visualize the missingness of the dataset across variables, after completing the dataset with missing time stamps. We can see that in about 4% of the time, the data is not recorded.

To explore further the coverage of pollutants across cities, we will visualize the observation records on a time series plot to see which parameters are recorded across different cities and time intervals.

SO2, NO2, and CO appear to have less coverage across all locations, whereas PM25 and PM10 are recorded more consistently across the majority of locations, making them suitable for further analysis without the need for significant imputation or data handling. Besides, Brisbane only have data recorded from 1st Aug to 1st Sep, suggesting careful handling in further analysis.

Data distribution:

We will use box plot to have a glimpse at the distributions of the parameters across locations.

From the plot above, we can infer several key observations:

  • CO, NO2: There are multiple high outliers, which may represent extreme pollution events or possible sensor anomalies.
  • O3, PM10, and PM25: We identified extreme negative outliers for these pollutants, which are likely errors in the dataset.
  • SO2: Some high outliers were observed, likely due to limited records or inconsistencies in the dataset.

To ensure our analysis focuses on typical air quality levels, we remove outliers—extreme values that could distort the results. The lower and upper bounds were determined using the interquartile range (IQR), where any values below 1.5 times the IQR from the lower quartile or above 1.5 times the IQR from the upper quartile were considered outliers.

By filtering out these outliers, we can focus on more accurate, typical pollutant readings, improving the quality of our analysis.

Results

As mentioned above, we focused on two key pollutants, PM10 and PM2.5, from the air quality data. For each city, we calculated the median value of these pollutants based on all the sensors that measured them. This gave us a representative value of air quality in each city. Finally, we organized the data so that each city has its own row, with separate columns showing the median levels of PM10 and PM2.5.

City pm10 pm25
Adelaide 2.6 2.3
Brisbane 16.8 7.1
Melbourne 13.3 3.2
Perth 2.2 2.1
Sydney 3.3 3.1

The bar chart above compares the concentration of two key pollutants, PM10 (in orange) and PM2.5 (in blue), across five Australian cities: Adelaide, Brisbane, Melbourne, Perth, and Sydney. As seen in the chart, Brisbane and Melbourne have significantly higher concentrations of both pollutants compared to the other cities, with Brisbane showing the highest levels for both PM10 and PM2.5. In contrast, Adelaide and Perth have much lower concentrations, while Sydney falls in between.

This data prompts the need to define what we mean by “clean air” for the purpose of this analysis. In the sections below, I will discuss how we can establish thresholds or criteria to classify air quality based on the observed pollutant levels. This will help in determining which cities have cleaner air relative to others based on the levels of PM10 and PM2.5.

Based on the comparison of PM10 and PM2.5 levels across the cities, we decided to define “clean air” by taking the halfway point between the median concentrations of these two pollutants. This threshold will help us classify cities with better air quality as those falling below the midpoint and more polluted cities as those above it.

Using this criterion, we create a new plot to visualize which cities have cleaner air and which do not, based on their pollutant levels relative to this defined midpoint. This approach will allow us to make clearer distinctions between cities regarding their air quality.

City PM10 (Median) PM25 (Median) Midway Point
Perth 2.2 2.1 2.1
Adelaide 2.6 2.3 2.5
Sydney 3.3 3.1 3.2
Melbourne 13.3 3.2 8.3
Brisbane 16.8 7.1 11.9

Based on the analysis, we can conclude that Perth has the cleanest air using the halfway point between the median PM10 and median PM2.5 values, narrowly surpassing Adelaide. However, it is important to note that both Adelaide and Perth have only one sensor each, which may introduce bias. While this is the best data available, future studies would benefit from each city having a similar number of sensors to ensure more accurate comparisons.

References


Wickham, H. (2019). *tidyverse: Easily install and load the 'tidyverse'*. R package version 1.3.0. https://CRAN.R-project.org/package=tidyverse

Wickham, H., & Bryan, J. (2019). *readxl: Read excel files*. R package version 1.3.1. https://CRAN.R-project.org/package=readxl

Tierney, N., & Cook, D. (2022). *visdat: Visualising whole data frames*. R package version 0.5.3. https://CRAN.R-project.org/package=visdat

Tierney, N., & Cook, D. (2022). *naniar: Data structures, summaries, and visualisations for missing data*. R package version 0.6.1. https://CRAN.R-project.org/package=naniar

Moritz, S. (2022). *imputeTS: Time series missing value imputation*. R package version 3.2. https://CRAN.R-project.org/package=imputeTS

Wickham, H. (2019). *rvest: Easily harvest (scrape) web data*. R package version 0.3.5. https://CRAN.R-project.org/package=rvest

Wickham, H. (2022). *conflicted: An alternative conflict resolution strategy*. R package version 1.0.4. https://CRAN.R-project.org/package=conflicted

Numbats. (n.d.). *airpurifyr: Air pollution modeling for Australia*. Retrieved from https://numbats.github.io/airpurifyr/

Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis*. Springer-Verlag New York. https://ggplot2.tidyverse.org

Kahle, D., & Wickham, H. (2013). *ggmap: Spatial visualization with ggplot2*. The R Journal, 5(1), 144-161. https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

Sievert, C. (2020). *Interactive web-based data visualization with R, plotly, and shiny*. Chapman and Hall/CRC. https://plotly-r.com

Hijmans, R. J. (2019). *geosphere: Spherical trigonometry*. R package version 1.5-10. https://CRAN.R-project.org/package=geosphere

Zhu, H. (2021). *kableExtra: Construct complex table with 'kable' and pipe syntax*. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra

OpenAI. (2023). *ChatGPT (October 2023 version) [Large language model]*. https://chat.openai.com